# Image Understanding
Qwen2 VL 2B GGUF
Apache-2.0
Qwen2-VL-2B is a vision-language model that provides a quantized version in GGUF format, suitable for various scenarios.
Text-to-Image
Transformers English

Q
tensorblock
314
0
Llava Critic 7b Hf
This is a transformers-compatible vision-language model with image understanding and text generation capabilities
Text-to-Image
Transformers

L
FuryMartin
21
1
Llava Saiga 8b
Apache-2.0
LLaVA-Saiga-8b is a vision-language model (VLM) developed based on the IlyaGusev/saiga_llama3_8b model, primarily optimized for Russian tasks while retaining English processing capabilities.
Image-to-Text
Transformers Supports Multiple Languages

L
deepvk
205
16
Llava Calm2 Siglip
Apache-2.0
llava-calm2-siglip is an experimental vision-language model capable of answering questions about images in Japanese and English.
Image-to-Text
Transformers Supports Multiple Languages

L
cyberagent
3,930
25
Paligemma 3B Chat V0.2
A multimodal dialogue model fine-tuned based on google/paligemma-3b-mix-448, optimized for multi-turn conversation scenarios
Text-to-Image
Transformers Supports Multiple Languages

P
BUAADreamer
80
9
Paligemma Vqav2
This model is a fine-tuned version of google/paligemma-3b-pt-224 on a subset of the VQAv2 dataset, specializing in visual question answering tasks.
Text-to-Image
Transformers

P
merve
168
13
Llava Llama 3 8b V1 1 GGUF
LLaVA model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image-to-text tasks
Image-to-Text
L
MoMonir
138
5
Llava Phi 3 Mini Hf
LLaVA model fine-tuned based on Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336, supporting image-to-text tasks
Image-to-Text
Transformers

L
xtuner
2,322
49
Blip Finetuned Fashion
Bsd-3-clause
This model is a visual question answering model fine-tuned from Salesforce/blip-vqa-base, specializing in the fashion domain
Text-to-Image
Transformers

B
Ornelas
2,281
0
Eris PrimeV3 Vision 7B
Other
Eris Prime V2 is a 7B-parameter multimodal language model with vision capabilities, requiring Koboldcpp for operation.
Text-to-Image
E
ChaoticNeutrals
118
8
Candle Llava V1.6 Mistral 7b
Apache-2.0
LLaVA is a vision-language model capable of understanding and generating text related to images.
Image-to-Text
C
DanielClough
73
0
Tecoa4 Clip
MIT
TeCoA is a vision-language model initialized from OpenAI CLIP, enhanced with supervised adversarial fine-tuning for improved robustness
Text-to-Image
T
chs20
51
1
Llava V1.6 Vicuna 13b Gguf
Apache-2.0
LLaVA is an open-source multimodal chatbot based on the Transformer architecture, offering various model versions that balance size and quality through quantization techniques.
Image-to-Text
L
cjpais
630
9
Ggml Llava V1.5 7b
Apache-2.0
LLaVA is a vision-language model capable of understanding and generating text related to images.
Image-to-Text
G
y10ab1
44
2
Pix2struct Vizwizvqa Base
Apache-2.0
This is a visual question answering model based on the Apache-2.0 license, supporting the English language, and focusing on handling vision-related question answering tasks.
Text-to-Image
Transformers English

P
nanom
16
0
Llava V1.5 13B GPTQ
Llava v1.5 13B is a multimodal model developed by Haotian Liu, combining visual and language capabilities to understand and generate content based on images and text.
Text-to-Image
Transformers

L
TheBloke
131
37
Mplug Owl Llama 7b
Apache-2.0
mPLUG-Owl is a multimodal large language model based on the LLaMA-7B architecture, supporting image understanding and text generation tasks.
Image-to-Text
Transformers English

M
MAGAer13
327
16
Taiyi BLIP 750M Chinese
Apache-2.0
A model focused on converting image content into text descriptions, supporting Chinese processing.
Text Recognition
Transformers Chinese

T
IDEA-CCNL
180
14
Beitbase
BEiT base model fine-tuned on an unknown dataset, specific use cases and performance details are currently unavailable
Large Language Model
Transformers

B
ivensamdh
15
0
Upernet Convnext Large
MIT
UperNet is a semantic segmentation framework combined with the ConvNeXt large backbone network for pixel-level semantic label prediction.
Image Segmentation
Transformers English

U
openmmlab
23.09k
0
Featured Recommended AI Models